Model Fitting

Given that we could just ask the simulation for more samples, I decided to NOT employ a typical training-split. Ratther, I will deploy all samples for training and optimize based on cross validation approaches, given every test record a turn at being in the training set and test set. Once the model is fitted, we can then ask the simulation for some more samples (say 1000) to use as a test set completely related here.

Model Evaluations

Many metrics can be used to evaluate models, some I calculate here are:

  1. Accuracy: TP + TN / total, this is the number of samples the RF model classifiers correctly
  2. Error Rate: 1 - Accuracy, this is the number of samples the RF model classifiers incorrectly
  3. True Positive Rate (TPR) | Sensitivity | Recall | Coverage: TP / (TP + FN), fraction of SAS examples correctly predicted
    • There's a typically a trade off between Recall and Precision below
  4. True Negative Rate (TNR) | Specificity: TN / (FP + TN), fraction of neutral examples correctly predicted
  5. False Positive Rate (FPR): FP / (FP + TN) fraction of neutral examples predicted as having SAS (really bad)
  6. False Negative Rate (FNR): FN / (TP + FN) fraction of SAS examples predicted as having SAS (not as bad, but still bad)
  7. Precision: TP / (TP + FP), fraction of samples that actually have SAS out of total samples predicted to have SAS
    • Precision addresses the question: "Given a sample predicted to have SAS, how likely is it to be correct?"
    • We may want to sacrifice Recall in order to archieve a high precision
  8. F-measure: $\frac{2*precision*recall}{precision+recall}$, a harmonic mean (meaning the resulting metric is *closer to the smaller of the input, that is, F-measure is closer to either precision or recall, whichever is smaller in magnitude) of precision and recall
    • Ideally, a F-measure should be high and indicate that both precision and recall are high
  9. Area under the Curve (AUC): this is the area of the curve under the Receiver operating characteristic (ROC) curve, which demonstrate the specificity-sensitivity trade-off (TPR vs TNR)

Confusion Matrix

drawing

Alternative Metric: Cost Matrix

Sometimes, in the realm of health sciences, we want to punish or reward the model for doing something really well compared to other. For example, in cancer prediction, more emphasis is implaced on avoiding False Negatives (not detecting the cancer), so that we may wish to assign costs/weights (negative means reward) to TP, FP, TN, and FN like this:

drawing

This can be implemented as other alternative to above metrics during cross validation. Or, cost matrix can be used to classify one particular record. That is, we can use cost matrix to evaluate risk

With a RandomForest, I am able to extract the probability of a sample as showing SAS or not, then say:

P(SAS) = 0.2 P(neutral, other) = 0.8

Given the above cost matrix, then when I:

Test Set Evaluation

Below are peformance measures on the test set never touched during model building

Next steps:

Generating a diversity of inputs:

Transfer of knowledge:

Build the same model using alt_output (means)

Remaining issues:

  1. Presence of NaNs and Infs
    • 3210 samples have NaNs sproadically throughout all features
    • 731 samples have Infs only for feature 7: (Tajima's D on the Y)
  2. How to generate a diverse, well-rounded dataset covering all scenarios possible?
    • Start playing with more complex cases, incorporation of 0.5 vs 0.5 allele frequency
  3. Fine tuning the model to archieve better results: Strong Overfitting and Poor Generalizability
    • Reasons for a high generalization gap:
      • Different distributions: The validation and test set might come from different distributions. Try to verify that they are indeed sampled from the same process in your code.
      • Number of samples: The size of the validation and / or the test set is too low. This means that the empirical data distributions differ too much, explaining the different reported accuracies. One example would be a dataset consisting of thousands of images, but also thousands of classes. Then, the test set might contain some classes that are not in the validation set (and vice versa). Use cross-validation to check, if the test accuracy is always lower than the validation accuracy, or if they just generally differ a lot in each fold.
      • Hyperparameter Overfitting: This is also related to the size of the two sets. Did you do hyperparameter tuning? If so, you can check if the accuracy gap existed before you tuned the hyperparameters, as you might have "overfitted" the hyperparameters on the validation set.
      • Loss function vs. accuracy: you reported different accuracies. Did you also check the train, validation and test losses? You train your model on the loss function, so this is the most direct performance measure. If the accuracy is only loosely coupled to your loss function and the test loss is approximately as low as the validation loss, it might explain the accuracy gap.
      • Bug in the code: if the test and validation set are sampled from the same process and are sufficiently large, they are interchangeable. This means that the test and validation losses must be approximately equal. So, if you checked the four points above, my next best guess would be a bug in the code. For example, you accidentally trained your model on the validation set as well. You might want to train your model on a larger dataset and then check, if the accuracies still diverge.
      • What I think: Test set too small
    • What metric do we want to use? (see modeling section for full list of possible metrics)
  4. Speeding up Scripts on TACC
    • 13 hour run time to generate 2000 samples using 1 node, long